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ABSTRACT 


In order to effectively grade persuasive writing we must be able 
to reliably identify and extract extract argument structures. In or- 
der to do this we must classify arguments by their structural roles 
(e.g., major claim, claim, and premise). Current approaches to clas- 
sification typically rely on statistical models with heavy feature- 
engineering or on deep neural-networks that do not consider prior 
knowledge or other secondary features. Little research has been 
carried out to investigate if we can incorporate features into deep 
models to address AM tasks. In this work, we propose to incor- 
porate lightweight features into deep models to classify argument 
components. We experimented with two state of the art (SOTA) 
approaches: 1) linear-Long-Short-Term Memory (LSTM) models 
with concatenated feature vectors; or 2) Directed Acyclic Graph 
(DAG) structured LSTMs. In our models we incorporated the fea- 
tures of argument position (e.g., if the argument is in the first para- 
graph) and prior knowledge of discourse indicators (e.g., in conclu- 
sion, for example). We use two baselines in our work: 1) prior work 
using SVM models with heavy feature engineering; 2) traditional 
linear-Bi-LSTMs with no task-specific features. 


Our results show that with a comparatively small number of lightweight 


features, both linear-Bi-LSTMs and DAG-Bi-LSTMs outperform 
SVM models that depend on more heavy feature engineering, and 
outperform linear-Bi-LSTMs with only general word embedding 
features. These results suggest that incorporating task-specific el- 
ements into deep models may potentially benefit argument mining 
tasks. 
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1. INTRODUCTION 


Current automated essay grading systems are typically focused on 
the syntactic and semantic analysis of written arguments via Natu- 
ral Language Processing (NLP) techniques (as in (23)B). These 
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systems are typically designed to evaluate arguments on the basis 
of: general readability (e.g., the number of prepositions and rela- 
tive pronouns or the complexity of the sentence structure); shallow 
semantic analysis (e.g., lexical semantics or the analysis of the re- 
lationship among named entities); and syntax analysis (e.g., gram- 
matical analysis). To the extent that argument structures consid- 
ered in this work have been focused on the limited identification 
of individual components (e.g., hypothesis statements (4)). or on 
manual analysis by human experts [14], which is costly and time- 
consuming. Few existing systems perform any automatic analysis 
of the argumentative structures or seek to identify structural flaws 
due to the lack of an auto-extraction mechanism in the system. 


In order to parse arguments it is necessary to extract the basic com- 
ponents. Extracting argument structures (EAS) is one of the essen- 
tial tasks of argumentation mining (AM).EAS can be divided into 
three sub-tasks: 1) argument component identification (ACI) break- 
ing down the text into argument units; 2) argument component 

classification (ACC) of classifying argument component (ACs) into 
types; 3) argument relation identification (ARI) of identifying 

the relationships between each pair of ACs. Prior researchers have 

focused on different subsets of these tasks (e.g., addressed 

ACI, ACC, and ARI separately, jointly modeled ACI and ACC) 

or built end-to-end models that address them sequentially (e.g. (18}). 
Our goal in this work, by contrast, is to investigate how to incorpo- 

rate task-specific features into deep learning models and whether 

those features can improve our models’ performance on the task 

of ACC. We carried out our work using an argumentation schema 

developed by Stab and Gurevych on a corpus of 402 persuasive es- 

says (PE) (27). As part of this work, we replicated their work on 

ACC and used it as a baseline model. 


Most current approaches to ACC either rely on heavy feature engi- 
neering or use deep models that only consider pre- 
trained word embeddings with no other secondary features 
[1i). Little research to date has been focused on incorporating prior 
knowledge or lightweight features into deep models for AM tasks. 
To the best of our knowledge, Lugini and Litman carried out the 
only work that adds features into LSTM based models to address 
ACC on the argument dataset of classroom discussions (13). In that 
work, they considered a set of features including semantic-density 
features (e.g. the number of pronouns), lexical features (e.g., uni- 
gram and bi-gram), and syntactic features from speech tags. They 
combined the feature matrix and LSTMs hidden output for clas- 
sification. They showed that the features boosted the deep model 
performance. 


In our work, we investigated whether or not it is possible to incor- 
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Figure 1: An example of DAG LSTM modeling an AC. 


porate lightweight features into deep models to address ACC on PE 
dataset with different feature sets and deep models. In this work we 
considered prior research on the prior knowledge of discourse in- 
dicators and position features. The discourse indicators have been 
shown are potential features for identifying the argumentative sec- 
tion of online product reviews [30]. Researchers have also demon- 
strated that the structural features of AC position (e.g., if the AC 
shows in the introduction, if an AC is the first sentence of one para- 
graph) and token statistics (e.g., number of tokens of an AC) are the 
most effective features for AC classification (27). For this study we 
only considered the position information for ACs. We chose to 
focus on these two features because these two are the most infor- 
mative and also require the least amount of feature-engineering. 


For our deep models, we experimented with Bi-LSTMs. We en- 
coded the incorporated features by one-hot encoding, computed 
the element-wise summation of feature vectors, and then combined 
feature vectors with Bi-LSTMs output for prediction. This ap- 
proach is the same as the work in (13). We also considered bidi- 
rectional DAG-Structured Recurrent Neural Networks (RNNs) to 
incorporate features. DAG-RNNs, also known as Neural Lattice 
Language models, are an extension of linear-chained RNN models 
that can consume DAG-structured input . If we treat the text 
as a linear path, the prior knowledge and secondary features can be 
added as new edges on the path to form a DAG structure. The dis- 
course features are connected to the parent- and child- nodes of the 
related tokens, similar to the work of (32). For position features, 
we simply connected them with the two special sequence delim- 
iters which indicate the beginning and end of the sentences. Fig- 
ure[T]shows an example of DAG input with discourse indicators of 
“FROM_THIS _POINT_OF _ VIEW” and “I_FIRMLY_BELIEVE 
_THAT” in red and position features of “IN_INTRODUCTION” 
and “IS_LAST _SENTENCE” in green. The original input is in 
the blue nodes. Token ‘“-B-” and ‘“-E-” are special sequence de- 
limiters. The nodes are indexed in topological order, as this is the 
order in which the one-directional DAG model consumes the input 
sequence. For bidirectional DAG models, we simply reverse the 
order. The intuition behind this approach is that it mimics how hu- 
mans read and annotate essays as humans can incorporate linguistic 
intuition to determine the role of the ACs in written argumentation. 
For example, if a sentence appears to be the last sentence of the 
introduction in a five-paragraph essay, it most likely contains the 
author’s standpoint, i.e., claim. 


DAG-RNNs have been used to incorporate linguistic knowledge 
(e.g., the non-compositional phrase in the form of n-gram) for sen- 
timent classification (32). They have also achieved SOTA results 
in many other NLP tasks such as neural machine translation (28). 
speech translation (25). and language modeling (2). In this work, 
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we utilized LSTMs, a special kind of RNN and building off Zhu et 
al.’s work [32], we implemented a DAG-LSTM in Tensorflow with 
a different hidden state bagging function (discussed in section|4). 


Our results show that linear-Bi-LSTMs with no task-specific fea- 
tures performed worse than traditional models. However once we 
incorporated our two features, both the linear-Bi-LSTMs and DAG- 
Bi-LSTMs outperform general Bi-LSTMs with no features and they 
outperform other models that rely on heavy feature-engineering. 
DAG-Bi-LSTMs slightly outperform the linear-Bi-LSTMs when 
considering both features. The linear-Bi-LSTMs with only posi- 
tion features yield the best results. 


The significance of this work is as follows. 1) Our work serves as 
the basis for automated essay grading systems, and can be applied 
to extract argument structures for detecting structural flaws. 2) We 
addressed a common issue in NLP that deep models tend to yield 
lower performance on small datasets. We showed that deep models 
can benefit from lightweight features and yield better performance. 
3) We experimented with DAG-LSTMs to incorporate features on 
text classification tasks. We showed that it could be a promising 
architecture to incorporate features into sequence models. 4) We 
tested the same approach used in to combine features with 
LSTMs on a different dataset. Our results are consistent with their 
work. 


2. RELATED WORK 
2.1 AC Classification 


Most of the prior work on AC classification relies on traditional 
classification models and heavy feature engineering. In [27], re- 
searchers applied multiclass SVMs to classify ACs using struc- 
tural, lexical, syntactic, discourse indicator, and contextual fea- 
tures. They obtained an F1 score of 0.794 on the PE corpus. In 
and (18), authors performed classification task on a small portion of 
the PE dataset, again relying on extracted features. In | 10], Namhee 
et al. analyzed online comments to identify and classify subjec- 
tive claims using lexical and syntactic features. In (16), researchers 
worked to classify ACs on legal documents using extracted features 
while Niall et al. applied kernel methods for argument detection 
and classification on AraucariaDB dataset (22). Different from the 
above, we only considered prior knowledge of discourse indicators 
and structural information of position features. 


Many researchers have begun to explore the application of deep 
neural-network models to argument mining. In (ij. authors exper- 
imented with CNN and RNN models to detect the claims and 
evidences on Wikipedia datasets. Potash et al. proposed a joint 
sequence-to-sequence model with attention to predict the links be- 
tween ACs and classify ACs on the PE dataset, where they consid- 
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ered the sequential nature of ACs [19]. In our work, by contrast, we 
focused on incorporating prior knowledge to deep neural-networks 
to classify the ACs alone. 


To the best of our knowledge, research from is the only work 
that combines feature engineering and deep models to address ACC 
on classroom discussion. They showed that SOTA deep models 


with only pre-trained embeddings performed poorly on their dataset. 


However, by including secondary features they improved the per- 
formance substantially. The features included: semantic-density 
features (e.g., number of pro-nouns, descriptive word-level statis- 
tics, number of occurrences of words of different lengths), lexical 
features (e.g,, tf-idf feature for each unigram and bi-gram, descrip- 
tive argument move-level statistics), and syntactic features (e.g., 
unigrams, bigrams, and tri- grams of part of speech tags). In their 
work, they experimented with Convolutional neural network mod- 
els and LSTMs. Their results showed that the model’s performance 
was improved after adding the secondary features. In our work, we 
considered the same approach to incorporate features into linear- 
LSTMs and compare the model performance with DAG-LSTMs. 


2.2 DAG-RNNs 

DAG-RNNs, also known as Neural Lattice Language (NLL) mod- 
els, are extensions of chain-structured RNNs [32] (28). These mod- 
els, first proposed by Zhu et al. in (32). leverage DAG structures to 
incorporate external semantics such as n-gram sentiment tags and 
expert annotations to improve performance on sentiment classifi- 
cation. Su et al. introduced NLL-based Gated Recurrent Units 
(GRUs) to encoder multiple word segmentation of Chinese text 
for translation (28). Sperber et al. later used NLL-based GRU 
models to consume word lattices from the up-stream modes of the 
speech recognizer for speech translation (25). These lattices were 
annotated with posterior probabilities on the alternative translation 
paths. And finally, in (2}. researchers demonstrated that the NLL 
models outperformed the LSTM-based models at the task of lan- 
guage modeling when incorporating multi-word phrases (n-grams) 
and multiple-embeddings for polysemy. However, little research 
has been done to utilize DAG-RNNs to integrate features for AM 
tasks. 


3. DATASET 

The PE dataset was developed by . It contains 402 essays from 
the online community essayforum|'| The forum provides writing 
feedback for different kinds of text. Students can post practice 
essays for standardized tests in the community and obtain feed- 
back about their writing skills. The dataset was randomly selected 
from the writing feedback section of the forum. The dataset comes 
with three argument components: major claim indicating the au- 
thor’s standpoint on the given controversial topic; claim of sub- 
standpoints that supports (“for”) or attacks (“against”) the major 
claim; premise that is the reason of the argument which supports 
or attacks the claim. Table[I]shows the class distribution of the PE 
corpus. The average number of tokens in major claims, claims, and 
premises are 19, 23 and 21, respectively. 


Major Claim | Claim | Premise 
Train & Dev 598 1202 3023 
Test 153 304 809 


Table 1: Number of instance in each class 


https://essayforum.com/ 
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Figure 2: An unit of DAG-LSTM. 
4. METHODS 


4.1 Linear-Bi-LSTMs 

In traditional Bi-LSTM models, ACs are encoded using Glove em- 
beddings that are obtained from training on large Wikipedia 
datasets. The encoded ACs are fed to Bi-LSTMs, and the last hid- 
den states are passed to the softmax layer for prediction. To incor- 
porate features, we used a one-hot vector to represent each feature 
and summed up the features vectors that are related to an AC. For 
example, if we have a total of three features in the feature space 
that are “if an AC is in the introduction’, “if an AC is in the conclu- 
sion’, and if an AC contains discourse indicator of “in conclusion”. 
We use one-hot vectors to represent three features as [0, 0, 1], [0, 
1, 0], and [1, 0, 0], respectively. When we have one example AC 
that contains a discourse indicator of “in conclusion,” and this AC 
is in conclusion paragraph, we sum up two feature vectors by ele- 
ments to get a vector of [0, 1, 1], which represents the feature for 
this AC. Then we concatenate the vectors on the hidden output of 
Bi-LSTMs for final prediction. The same approach has been used 


in (13) 


4.2 DAG-Bi-LSTMs 

We implemented the DAG-Bi-LSTMs using the TensorFlow plat- 
form. |(| The DAG-Bi-LSTMs in our work is similar to the models 
described in (32). However, we applied a different hidden state 
merge function. While Zhu et al. used binarization, we elected to 
sum the parent hidden states as suggested in the TreeLSTM work 
(29). Intuitively, by summing the previous states, we expect the 
DAG-models to learn both the summarized linear history and the 
incorporated knowledge. We used the same one-hot method to en- 
code the incorporated features. 


For sequential inputs the linear-LSTM models calculate hidden states 
hz and cell states c; based upon the proceeding hidden state ht_1, 
cell states c,_1 and the input embedding e; of token 2; as: 


hi, cz = LST M (he-1, ct-1, et, 0) (1) 


The primary difference between DAG- and linear- LSTMs is that 
the former can have multiple parent and child states, as shown in 
Figure[2| while the latter cannot. Given a DAG input, h,, indicates 
a set of parent states at t time step, where i = 0,1,...,orn and 
p; € P. The DAG model first gathers its parent hidden states hp, 
and then sums over the parents’ hidden states and over the parents’ 
cell states as follows: 


hp, = >) by, ex — >” Gp, (2) 


pier PicPt 


'We also experimented with DAG-Bi-Gated Recurrent Units, but 
DAG-Bi-LSTMs yielded better results. 
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The remainder of the DAG process is similar to that of linear-Bi- 
LSTMs in that hp,, cp,, e+ are fed to standard LSTM unit to gen- 
erate new hidden, cell states of hz, cz, which are then copied to the 
child states. Finally, the last hidden states are fed to Multi-Layered 
Perceptrons (MLPs) for prediction. 


4.3 Prior Knowledge and Features 

In this work, we considered the prior knowledge of discourse indi- 
cators and the AC position features. In prior work (27). Stab and 
Gurevych collected a list of hand-crafted features for ACC tasks, 
including lexical, structural, discourse indicator, contextual, syn- 
tactic, etc. The detailed explanations of the features can be found 
in Section 5.3.1 of (27). For discourse indicators, they include five 
categories: forward indicators (e.g., “therefore”); backward indi- 
cators (e.g., “because”); thesis indicators (e.g., “in my opinion’); 
rebuttal indicators (e.g., “although’”); and first-person indicators 
(e.g., “T’, “me”’). For the position features, we annotated sentences 
if they were the first/last sentence and if they showed up in the 
last/first paragraph. The annotations are in the special n-gram form 
(e.g., “IS_LAST_SENTENCE”, “_THEREFORE_”) so that they 
are distinguished from original corpus. We also experimented with 
annotating the discourse indicators by category, such that the for- 
ward indicators were annotated as “FORWARD _INDICATORS”. 


5. EXPERIMENTAL SETUP 


We carried out a series of experiments using the same static train- 
ing/testing split as in prior work (27). Since the corpus does not 
have a designated development set, we used stratified sampling to 
select 15% of the training set to tune our hyper-parameters and re- 
ported our final results on the designated test set. We ran each 
experiment five times and reported the average Macro-F1 score of 
the test dataset. 


We carried out four distinct experiments: Base-SVMs, which repli- 
cates the work in with multiclass SVMs using polynomial ker- 
nels on a set of features; Base-LSTMs, which are baseline mod- 
els of general Bi-LSTMs with no secondary features; LSTMs and 
DAG-LSTMs refers to Bi-LSTMs and DAG-Bi-LSTMs with task- 
specific features. 


We used a grid search for hyper-parameter tuning, and we used 
the same set of parameters across all the models. We used 300- 
dimensional GloVe embeddings (17). Tokens not present in the pre- 
trained embeddings or not features were randomly initialized with 


uniform samples from range [—4/ 32, y/ 42] where dim is 


the dimension of the embeddings 300. All of the tokens in the test 
and dev sets but not in the training set have one unique random 
embedding. The embeddings were fixed during training. We then 
used the Adam optimization algorithm with a learning rate of 
0.005, a batch size of 32, a layer LSTM with a hidden size of 64, 
and a drop out rate of 0.2, and a layer tanh-MLP with a hidden size 
of 64. 


6. RESULTS & DISCUSSION 


In this section, we will discuss the overall results. Later we will 
talk about how each feature impacts the linear-LSTMs and DAG- 


LSTMs by comparing them with traditional SVMs and linear-LSTMs 


with no task-specific features. In the end, we will compare the per- 
formance of linear-LSTMs and DAG-LSTMs. 


6.1 Overall Results 


Table[2|shows the results of each experiment and our baseline met- 
rics. The standard deviations of the deep model results are all 
less than 0.009 over the runs. The first three columns show our 
benchmark, Stab and Gurevych’s results with SVMs on: all fea- 
tures, structural features alone (including AC position in the doc- 
ument and token statistics), and contextual features alone (includ- 
ing discourse indicators and the number of noun and verb phrases 
in an AC). The next two columns show two of our baseline mod- 
els: Base-SVMs and Base-LSTMs. Then the rest shows the linear- 
LSTMs and DAG-LSTMs with the two features together and sepa- 
rately. Pos includes the position features are considered, while dis 
refers to the discourse indicators were incorporated, and Pos-dis 
indicates that both features are used. 


Overall, for linear-LSTMs with only position features return the 
best macro-F1 score of 0.805 across the board. DAG-LSTMs with 
both position and discourse features return a very close score of 
0.802. Drilling won, linear-LSTMs with position features also yield 
the best F1 score for claim and premise components, especially for 
claim components, the F1 score is increased by 23% over the base- 
LSTMs and increased by 4.5% over base-SVMs. For major claims, 
the SMVs from prior work still have the best F1 score. 


Thus for traditional models with heavy feature engineering, our 
Base-SVMs are close to Stab and Gurevych’s result (SVMs) but 
with a lower F1 score on major claims. This may be due to minor 
differences in our feature extraction or the different experimental 
setting. Among three deep models, the base-LSTMs with only pre- 
trained word embeddings perform very poorly, and all the F1 scores 
are much lower than the SVM models, which is not very surprising 
because of the small amount of data. The trained models do not 
generalize well on test data. This may also be due to the fact that 
the pre-trained embeddings are obtained from training models on a 
large corpus of Wikipedia data (17). which can be thought of prior 
knowledge. However, Wikipedia is very different from the PEs, 
the writing is generally more formal; it is a product of collabora- 
tive work; and is heavily edited. PEs are most likely composed by 
non-native English writers. Thus the base-LSTMs with glove em- 
beddings may not able to catch the semantic meaning in the PEs. 


However, once we incorporated the position and discourse features, 
we obtained a very high Macro FI score. We have F1 of 0.801 
and 0.802 for linear and DAG models, a 16% improvement over 
Base-LSTMs with no features (0.633). These results imply that 
both models can utilize task-specific features to boost performance. 
When we compare deep models with heavy feature engineering 
based SVMs, we find that the deep models with only two features 
outperformed the SVMs with heavily features Engineering, which 
suggests that combining little features with deep models can im- 
prove the learning on AM tasks. 


6.2 Position Features 

When adding position features, both the linear models and DAG 
models outperformed the SVMs and base-LSTMs. The linear mod- 
els return the best results. In fact, after incorporating position fea- 
tures, the major claims that were previously misclassified as premises 
by the Base-LSTMs were now either classified as claims or major 
claims in the linear and DAG models. However, while adding the 
position features improves the tasks on identifying the claims and 
major claims, they are not good at distinguishing the two. Looking 
deeper, the resulting models tend to be biased towards the major 
claim classification, especially for claims that show up in the in- 
troduction or conclusion. One possible explanation for this is that 
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SVMs (Stab et al.) Base-SVMs Base-LSTMs LSTMs DAG-LSTMs 
Features All Pos _ Dis All None Pos-Dis Pos Dis Pos-Dis Pos Dis 
Major claim 891 803.656 844 625 832 823.650 .852* 845.645 
Claim 611 551  .248 .640 455 .670 .685* 428 .659 668 .498 
Premise 879 .870  .836 886 819 .903 .907* —.820 896 886 = .797 
Macro-F1 794 .741 580 .790 633 .801* 805* = .633 .802 .7199 647 


Table 2: Results on classifying ACs on PE corpus. “AIl’’ column refers to the eight type of features in (27). “Pos” column indicates 
the models with position features. “Dis” columns shows the model results with discourse features. * means significant improvement 
over Linear-LSTMs with no features. Bold indicates the best result per row. 


the position features dominate the feature space in the PE dataset; 
models pay too much attention to the position features and too little 
attention to the semantic context. 


Our results are consistent with prior work which has suggested that 
position features play an important role in classifying ACs on the 
PE dataset. As shown in Table [2] with only position features, the 
traditional SVMs can reach a macro F1 score of 0.741. The utility 
of this feature is relatively intuitive. In five-paragraph essays, the 
major claims usually show up in the first or last paragraphs. And 
in our PE dataset, 70% of the major claims were either the last 
sentence of the introduction or in the first sentence of conclusions, 
while 67% of key subsidiary claims show up in the first or last 
sentence of the middle paragraphs. 


Our results also suggested that we can incorporate non-semantic 
features into deep models to help the model learn, especially these 
non-semantic features can not be captured by the word embedding 
features. 


6.3 Discourse Indicator features 

Interestingly, the performance of linear models was not improved 
after adding the discourse features, and DAG model’s performance 
was improved a little. When we examine the data more closely 
we see that one possible reason is the current discourse indicator 
list provided in does not cover all the cases in the PE. We 
identified more discourse indicators that are not included in the list, 
such as thesis indicators of “it is believed that’, “to summarise”, “in 
short”, “‘it is undeniable”, and “I admit that” and forward indicators 
of “based on the above discussion”. We will address this problem 
in our future work. 


We also experimented with incorporating discourse features by cat- 
egory. We ran another set of experiments based on the same model 
parameters setting as above. Below Table[3]shows the results. 


LSTMs 
Pos-Dis Dis 


DAG-LSTMs 
Pos-Dis Dis 


Features 


Major Claim .843* .662* .859* = .675* 


Claim .666 475* 659 .506* 
Premise 899 817 894 .794 
Macro-F1 803* .651*  .804* = .659* 


Table 3: Results for incorporating discourse indicators as anno- 
tation types for Linear-LSTMs and DAG-LSTMs.* means im- 
provement over the results that show in Table[2| 


After incorporating discourse indicator features by category, the 
model’s performance was improved. In Table[3] * indicates the im- 
provement over the results from n-gram discourse features. Both 
the linear and DAG models yield slightly better results on major 
claim and claim components, especially the major claim. Deeply 
looking at the results, some major claims were misclassified as 
claims before, which were correctly predicted here. The reason 
could be that when we consider the discourse indicators by cate- 
gory, we only have five features. For each feature, we have suffi- 
cient training examples for it. In fact, the thesis, first-person, for- 
ward, rebuttal, and backward indicators show up 701, 1535, 968, 
719, and 1769 in the total data. Thus, it is easier for the models to 
capture the discourse features. However, when we considered them 
in the n-gram form, we have more than 100 of them. The number of 
them shows up in the ACs are much less than above. And some of 
them only show up once in the entire data, such as thesis indicators 
of “all things considered” and backward indicator of “is due to the 
fact that”, which prohibits the models learning useful information. 


Overall, the discourse indicator features did not improve the model’s 
performance massively, which indicates that deep models with pre- 
trained embeddings already capture the semantic information. Thus, 
adding discourse indicator features does not help as much as posi- 
tion features on PE dataset. 


6.4 Linear-LSTMs vs DAG-LSTMs 

When adding both classes of features, the linear and DAG mod- 
els yielded similar results for their macro-F1 score. However, the 
linear models outperformed the DAG models with the position fea- 
tures alone, and the DAG models utilized the discourse indicator 
features better in both the Table [2] and Table [3] One possible ex- 
planation is that when we concatenated the feature vectors on the 
last hidden output of linear models, the models are not able to learn 
the interactions between the discourse indicators and surrounding 
words. But the DAG models can learn those interactions by merg- 
ing the hidden annotation states with current hidden states, and the 
current states contain the semantic information from all previous 
tokens. For example, for the DAG-input shown in Figure[]] we use 
the index of the DAG input to refer the time step that the LSTM unit 
processes the hidden state. At time step 15 of the forward training, 
we first merge the hidden output of state 10 and state 14, and then 
pass the merged hidden state as the hidden input of state 15. In this 
way, the DAG models consider both semantic meaning of all previ- 
ous tokens and the discourse feature of “I firmly believe that”, and 
pass that information to the next hidden state. These results imply 
that DAG-models tend to utilize the semantic features better as they 
can learn the interaction between the features and tokens. However, 
they might not perform very well when we consider non-semantic 
features, such as the position features used here. One possible ap- 
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proach to address this problem is to combine two proposed methods 
to incorporate features. We can use DAG models to incorporate dis- 
course indicator features and then concatenate the one-hot position 
feature vectors on the final hidden states for prediction. 


7. CONCLUSIONS 


In this work, we experimented with two approaches to combine fea- 
ture engineering with deep models: linear-Bi-LSTMs with feature 
vectors concatenated on the hidden output and DAG-Bi-LSTMs. 
Our results show that both deep models could benefit from task- 
specific features, as both of them outperformed traditional mod- 
els with heavy feature engineering and deep models with no task- 
specific features. We also show that the linear models handle the 
position feature better, and that the DAG models utilize the seman- 
tic features better since the linear models can not learn the interac- 
tion between discourse features and tokens. And finally we show 
that the deep models benefit more from position features than dis- 
course indicator features on the PE dataset. Our results imply that 
when we apply the deep learning models to classify ACs, we could 
consider utilizing some task-specific features to guide the model 
learning. 


This work can serve as a basis for the development of structurally- 
aware support platforms for reading and writing. This can include 
automated essay grading systems that detect and evaluate structural 
deficiencies as well as writing tutors that scaffold the construction 
of coherent essays or identify structural issues. As discussed in 
Section[I] current automated grading systems suffer from the lack 
of reliable auto-extraction mechanisms with most still relying on 
traditional ML models that use heavy feature engineering to func- 
tion. Such work is costly and time consuming to develop and may 
not always generalize to other essay types. Our work addresses 
this problem by showing that lightweight features and off the shelf 
methods can outperform those methods. At the same time our 
work also showed that while traditional machine learning models 
are costly and deep learning models are sensitive to small datasets, 
as discussed by (8}[31). this limitation too can be addressed through 
the use of lightweight feature work to guide the deep models. By 
addressing these two problems we have shown a path for develop- 
ing robust argument detection mechanisms for automated educa- 
tional platforms using novel deep learning approaches a path that 
can lead to substantive improvements for students and educators. 


8. FUTURE WORK 


These preliminary results serve as a basis for our ongoing research, 
in which we are building an end-to-end model with feature engi- 
neering to address all three sub-tasks for argument structure ex- 
traction. For that work we will frame this task as sequence tag- 
ging problems. We propose to use linear-LSTM and DAG-LSTM 
based models with task-specific features to address EAS. We es- 
timate that incorporating the task-specific features into end-to-end 
models can improve the model’s performance compared to the deep 
models based on general word embedding (6). 


In future work, we will also consider experimenting these two ap- 
proaches on different argumentation datasets, and compare the re- 
sults with fine-tuning SOTA language models (e.g. BERT (5). TS 


(20). 
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